Extractor API #16

psrpinto · 2024-08-20T16:25:58Z

WIP

Doubts

Naming

Need better names for:

SiteData
SiteInfo?

What should happen if the extractor thinks it can handle the source but then can't?

I guess we'd tell the user something like "Extractor didn't find anything, try another one"? Or, instead of having the extractor say whether it supports a source, should we ask the user right away which extractor they want to use?

Should an extractor extract a single type of data?

For example, should the wordpress-rest extractor extract posts and pages, or should there be a wordpress-post-rest and a wordpress-page-rest extractor?

Having different extractors for different kinds of data would unlock the possibility of having specific extractors for certain data types. For example, there could be wordpress-product-rest what extracts products from an eccommerce site.

What about multi-language sources?

We don't need to support it right away, but we should keep it in mind while designing the API.

Simplify directory structure

akirk

What should happen if the extractor thinks it can handle the source but then can't?

What about having extractors report a confidence between 0 and 100 that they can extract the source?

Should an extractor extract a single type of data?

I think it should not be limited to a single type of data since parsing the DOM might give multiple pieces of data as side-effects. For example, followers and follows might be on the same page and thus could be extracted at the same time.

What concerns me more is how we'll register extractors considering that we might arrive at a large number of them. Maybe we can have a two stage system where we'd only add a subset of extractors based on the URL matching. WordPress for example could be added for all URLs except those that specifically match other matchers.

That way we wouldn't be registering all extractors on every page (since the content script loads into every page).

What about multi-language sources? We don't need to support it right away, but we should keep it in mind while designing the API.

👍

akirk · 2024-08-23T11:50:05Z

src/extractor/source.ts

+ * Source of data to be extracted, like a DOM document, a URL or any other kind of resource.
+ * For the moment, only DOM Document is supported.


Could you elaborate on these other kinds of resources? The URL can be accessed through document.location.href.

The idea behind having a Source interface is so that the API does not depend on a specific data structure, that might not be available in all runtimes. If in the future we would like to run an extractor in nodejs, for example, the document would not exist (or a least would not have the same type).

Another reason would be that we can envision having extractors that don't rely on a document, but instead, for example, pull directly from a URL. (We could also make it so that an extractor can support multiple types of Sources, e.g. DOMSource and URLSource).

If we would not introduce the notion of a Source at this moment, adding support later for multiple types of sources would be a breaking change to the API, which would require updating all existing extractors.

psrpinto · 2024-08-23T16:03:05Z

That way we wouldn't be registering all extractors on every page (since the content script loads into every page).

This is something I hadn't considered. I think probably we should only run the content scripts if the extension is currently open.

psrpinto · 2024-09-02T14:00:00Z

I will open a new PR to implement this when required.

psrpinto changed the base branch from trunk to simplify-directory-structure August 20, 2024 16:26

psrpinto force-pushed the extractor-api branch 3 times, most recently from 1bfda77 to 57cc696 Compare August 21, 2024 12:54

akirk force-pushed the simplify-directory-structure branch from 486ac09 to ebd0919 Compare August 21, 2024 12:56

psrpinto force-pushed the extractor-api branch from 57cc696 to 070b712 Compare August 21, 2024 13:29

psrpinto added 13 commits August 21, 2024 15:47

Merge pull request #15 from akirk/simplify-directory-structure

2d5b15a

Simplify directory structure

Add Extractor API

ea60764

Temporarily comment out playground boot

8c72ee0

Call extractor

ad143a3

Implement handles() of wordpress-rest extractor

d179748

Make extract method async

867321e

Rename meta to info

2910938

Rename Entry to SiteData

e3a25ed

Rename function to extractData()

6ef8e11

Add SiteInfo

33018f5

Add Source

c3783e0

Rename handles() to supports()

0eea4dd

Validate slug

0a331b7

psrpinto force-pushed the extractor-api branch from b1f5e62 to 0a331b7 Compare August 21, 2024 14:51

akirk reviewed Aug 23, 2024

View reviewed changes

psrpinto deleted the branch simplify-directory-structure September 2, 2024 13:57

psrpinto closed this Sep 2, 2024

psrpinto deleted the extractor-api branch October 7, 2024 15:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extractor API #16

Extractor API #16

psrpinto commented Aug 20, 2024 •

edited

Loading

akirk left a comment

akirk Aug 23, 2024

psrpinto Aug 23, 2024

psrpinto commented Aug 23, 2024

psrpinto commented Sep 2, 2024

		* Source of data to be extracted, like a DOM document, a URL or any other kind of resource.
		* For the moment, only DOM Document is supported.

Extractor API #16

Extractor API #16

Conversation

psrpinto commented Aug 20, 2024 • edited Loading

Doubts

Naming

What should happen if the extractor thinks it can handle the source but then can't?

Should an extractor extract a single type of data?

What about multi-language sources?

akirk left a comment

Choose a reason for hiding this comment

akirk Aug 23, 2024

Choose a reason for hiding this comment

psrpinto Aug 23, 2024

Choose a reason for hiding this comment

psrpinto commented Aug 23, 2024

psrpinto commented Sep 2, 2024

psrpinto commented Aug 20, 2024 •

edited

Loading